Skip to main content
Get started with OpenAI’s GPT-OSS, duo that’s built for
Transparency and Control.
These versatile open-weight reasoning models are crafted for developers, researchers, and enterprises who require transparency, adaptability, and the freedom to customize - all while retaining powerful chain-of-thought reasoning capabilities. Both GPT-OSS models are trained to think step-by-step before producing an answer, enabling them to excel at complex reasoning tasks including coding challenges, mathematical problems, strategic planning, puzzles, and agent-based workflows. They also support adjustable reasoning effort levels, giving you fine-grained control to balance deeper reasoning performance with the computational cost that best fits your use case.

How to use GPT-OSS Inference API

These models are accessible exclusively to users on Build Tier 1 or higher. Because reasoning models generate longer, more detailed outputs through chain-of-thought processing, we recommend streaming tokens to ensure a smoother and more responsive user experience.
import requests
import json

url = "https://platform.qubrid.com/api/v1/qubridai/chat/completions"
headers = {
"Authorization": "Bearer Qubrid_API_KEY",
"Content-Type": "application/json"
}

data = {
"model": "openai-gpt-oss-120b",
"messages": [
  {
    "role": "user",
    "content": "Explain quantum computing to a 5 year old."
  }
],
"temperature": 0.7,
"max_tokens": 65536,
"stream": true,
"top_p": 0.8
}

response = requests.post(url, headers=headers, json=data)
result = response.json()
This will produce the response below:
{
    "id": "chatcmpl-4968cd8b586c496bbef3bba04edbbe75",
    "object": "chat.completion",
    "created": 1764850905,
    "model": "openai/gpt-oss-120b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "## Quantum Computing – A Primer\n\n### 1. What Is a Quantum Computer?\nA **quantum computer** is a device that performs computation using the laws of quantum mechanics instead of the classical physics that underlies today’s laptops and smartphones.  \nWhile a classical computer manipulates bits that are either **0** or **1**, a quantum computer manipulates **quantum bits** (or **qubits**) that can exist in a superposition of both states simultaneously. This property, together with **entanglement** and **interference**, lets a quantum processor explore many computational paths in parallel, potentially solving certain problems far faster than any classical machine.\n\n---\n\n### 2. Core Quantum Concepts\n\n| Classical Idea | Quantum",
                "refusal": null,
                "annotations": null,
                "audio": null,
                "function_call": null,
                "tool_calls": [],
                "reasoning_content": "We need to respond with an explanation of quantum computing. The system messages: we are ChatGPT, we have to follow policies. There's no disallowed content. Provide a clear, thorough explanation, possibly with analogies, basics of qubits, superposition, entanglement, quantum gates, algorithms, challenges, current state, future. Should be accessible but also technical. Use headings. Possibly ask follow-up. Provide references. Ensure no copyrighted text. Provide original content. Let's produce a comprehensive answer."
            },
            "logprobs": null,
            "finish_reason": "length",
            "stop_reason": null,
            "token_ids": null
        }
    ],
    "service_tier": null,
    "system_fingerprint": null,
    "usage": {
        "prompt_tokens": 75,
        "total_tokens": 331,
        "completion_tokens": 256,
        "prompt_tokens_details": null
    },
    "prompt_logprobs": null,
    "prompt_token_ids": null,
    "kv_transfer_params": null
}
To access just the chain-of-thought reasoning you can look at the reasoning property:
We need to respond with an explanation of quantum computing. The system messages: we are ChatGPT, we have to follow policies. There's no disallowed content. Provide a clear, thorough explanation, possibly with analogies, basics of qubits, superposition, entanglement, quantum gates, algorithms, challenges, current state, future. Should be accessible but also technical. Use headings. Possibly ask follow-up. Provide references. Ensure no copyrighted text. Provide original content. Let's produce a comprehensive answer.

Available Models

Two capable open-weight models are available to meet different deployment needs: GPT-OSS 120B:
  • Model String: openai/gpt-oss-120b
  • Hardware Requirements: Fits on 80GB GPU
  • Architecture: Mixture-of-Experts (MoE) with token-choice routing
  • Context Length: 128k tokens with RoPE
  • Best for: Enterprise applications requiring maximum reasoning performance
GPT-OSS 20B:
  • Model String: openai/gpt-oss-20b
  • Hardware Requirements: Lower GPU memory requirements
  • Architecture: Optimized MoE for efficiency
  • Context Length: 128k tokens with RoPE
  • Best for: Research, development, and cost-efficient deployments

GPT-OSS Best Practices

Reasoning models like GPT-OSS should be used differently than standard instruct models to get optimal results: Recommended Parameters:
  • Reasoning Effort: Use the adjustable reasoning effort levels to control computational cost vs. accuracy.
  • Temperature: Use 1.0 for maximum creativity and diverse reasoning approaches.
  • Top-p: Use 1.0 to allow the full vocabulary distribution for optimal reasoning exploration.
  • System Prompt: The system prompt can be provided as a developer message which is used to provide information about the instructions for the model and available function tools.
  • System message: It’s recommended not to modify the system message which is used to specify reasoning effort, meta information like knowledge cutoff and built-in tools.
Prompting Best Practices: Think of GPT-OSS as a senior problem-solver – provide high-level objectives and let it determine the methodology:
  • Strengths: Excels at open-ended reasoning, multi-step logic, and inferring unstated requirements
  • Avoid over-prompting: Micromanaging steps can limit its advanced reasoning capabilities
  • Provide clear objectives: Balance clarity with flexibility for optimal results

GPT-OSS Use Cases

  • Code Review & Analysis: Comprehensive code analysis across large codebases with detailed improvement suggestions
  • Strategic Planning: Multi-stage planning with reasoning about optimal approaches and resource allocation
  • Complex Document Analysis: Processing legal contracts, technical specifications, and regulatory documents
  • Benchmarking AI Systems: Evaluates other LLM responses with contextual understanding, particularly useful in critical validation scenarios
  • AI Model Evaluation: Sophisticated evaluation of other AI systems with contextual understanding
  • Scientific Research: Multi-step reasoning for hypothesis generation and experimental design
  • Academic Analysis: Deep analysis of research papers and literature reviews
  • Information Extraction: Efficiently extracts relevant data from large volumes of unstructured information, ideal for RAG systems
  • Agent Workflows: Building sophisticated AI agents with complex reasoning capabilities
  • RAG Systems: Enhanced information extraction and synthesis from large knowledge bases
  • Problem Solving: Handling ambiguous requirements and inferring unstated assumptions
  • Ambiguity Resolution: Interprets unclear instructions effectively and seeks clarification when needed

Managing Context and Costs

Reasoning Effort Control

GPT-OSS features adjustable reasoning effort levels to optimize for your specific use case:
  • Low effort: Faster responses for simpler tasks with reduced reasoning depth
  • Medium effort: Balanced performance for most use cases (recommended default)
  • High effort: Maximum reasoning for complex problems requiring deep analysis. You should also specify max_tokens of ~30,000 with this setting.

Token Management

When working with reasoning models, it’s crucial to maintain adequate space in the context window:
  • Use max_tokens parameter to control response length and costs
  • Monitor reasoning token usage vs. output tokens - reasoning tokens can vary from hundreds to tens of thousands based on complexity
  • Consider reasoning effort level based on task complexity and budget constraints
  • Simpler problems may only require a few hundred reasoning tokens, while complex challenges could generate extensive reasoning

Cost/Latency Optimization

  • Implement limits on total token generation using the max_tokens parameter
  • Balance thorough reasoning with resource utilization based on your specific requirements
  • Consider using lower reasoning effort for routine tasks and higher effort for critical decisions

Technical Architecture

Model Architecture

  • MoE Design: Token-choice Mixture-of-Experts with SwiGLU activations for improved performance
  • Expert Selection: Softmax-after-topk approach for calculating MoE weights, ensuring optimal expert utilization
  • Attention Mechanism: RoPE (Rotary Position Embedding) with 128k context length
  • Attention Patterns: Alternating between full context and sliding 128-token window for efficiency
  • Attention Sink: Learned attention sink per-head with additional additive value in the softmax denominator

Tokenization

  • Standard Compatibility: Uses the same tokenizer as GPT-4o
  • Broad Support: Ensures seamless integration with existing applications and tools

Context Handling

  • 128k Context Window: Large context capacity for processing extensive documents
  • Efficient Patterns: Optimized attention patterns for long-context scenarios
  • Memory Optimization: GPT-OSS Large designed to fit efficiently within 80GB GPU memory